Optimizing for SSE: A Case Study

Cort Stratton

Introduction

In this article I'll be describing the process of optimizing a piece of code to make use of Intel's SSE instruction set. As an example, I will work with a 4x4 matrix-vector multiplication function (commonly found in most 3D graphics engines). Starting from a naïve C++ implementation, I will progressively apply various optimization techniques, until the function is more than four times faster than its original version. The code samples are in C++ with inline Intel x86 assembly (MSVC syntax), but the techniques generalize easily to other environments and platforms.

We'll apply these techniques to a problem that often arises in computer graphics: you want to multiply a 3D vector by a 4x4 matrix. This problem arises when you have a 3D vertex that you wish to transform, using a standard 4x4 transformation matrix. Of course, all graphics cards since the original Nvidia GeForce perform final vertex transformations in hardware. But there are still plenty of reasons to be transforming your vertices before sending them to the graphics card (matrix composition, visibility culling and collision detection, to name a few - not to mention non-hardware-accelerated applications).

Naïve C++

So, hopefully you agree that it's important to have a fast software matrix-vector multiplier. Well, you're in luck - there are plenty of fast implementations available! In Hugi #20, Chris Dragan uses AMD's 3DNow extensions to write a matrix-vector multiplier that runs in 16 cycles! But…what if you don't have an Athlon (or what if you want your demo to run well on non-Athlons)? Well, not to worry - Intel provides a free matrix library, optimized for the Pentium 3 and 4! Unfortunately, to make use of the optimized versions of Intel's matrix functions, you need to be using Intel's proprietary compiler. Well, there's always libSIMD, a free cross-platform optimized matrix library. However, the libSIMD project is still in the early stages of development, and currently doesn't provide any optimization at all. So why not go dig out your Linear Algebra textbook, look up the definition of matrix multiplication, and write some C++ code! I suspect your first draft would look something like this:

// MatrixMultiply1 -- a naive C++ matrix-vector multiplication function.
// It's correct, but that's about the only thing impressive about it.
//
// Performance: ~90 cycles/vector
Vector4f MatrixMultiply1(Matrix4f &m, Vector4f &vin)
{
	float v0 =	m.elts[0][0]*vin[0] + m.elts[0][1]*vin[1] + 
							m.elts[0][2]*vin[2] + m.elts[0][3]*vin[3];
	float v1 =  m.elts[1][0]*vin[0] + m.elts[1][1]*vin[1] +
							m.elts[1][2]*vin[2] + m.elts[1][3]*vin[3];
	float v2 =  m.elts[2][0]*vin[0] + m.elts[2][1]*vin[1] +
							m.elts[2][2]*vin[2] + m.elts[2][3]*vin[3];
	float v3 =  m.elts[3][0]*vin[0] + m.elts[3][1]*vin[1] + 
							m.elts[3][2]*vin[2] + m.elts[3][3]*vin[3];
	return Vector4f(v0,v1,v2,v3);
}

A brief note on cycle timings: the timings I give for each function were achieved when running the sample code on my 500 MHz Pentium 3. Different machines will have different timings (for instance, the Pentium 4 offers significantly faster SIMD performance than the Pentium 3). But the absolute timings are mostly irrelevant - what matters is the relative timings, which show off the performance gains from each successive optimization.

So sure, the function works - but if all you're looking for is code that WORKS, you wouldn't be reading Hugi, would you? Functionality isn't good enough - we want something FAST! And since Chris Dragan's 3DNow version runs more than 5 times faster, you may suspect (correctly, as it turns out) that we can make some pretty heavy optimizations. You can optimize the C++ all you want (see MatrixMultiply2 in the code sample), but you aren't going to get the performance increase you're hoping for.

Basic SSE

Why does Chris's version run so fast? Because it uses 3DNow! 3DNow is an example of a SIMD (Single Instruction, Multiple Data) instruction set, which lets you execute the same operation on multiple pieces of data simultaneously. Intel provides its own SIMD instructions, called SSE (Streaming SIMD Extensions), which are available on the Pentium 3 and up. SSE lets you perform operations on four 32-bit floats in parallel! (The Pentium 4's SSE2 instructions provide similar operations for 64-bit doubles, but most graphics applications don't need the extra precision).

SIMD lends itself amazingly well to matrix operations. Intel's developer site contains descriptions of SSE-optimized matrix multiplication, inversion, and L-U decomposition algorithms. After familiarizing yourself with the SSE instructions (Intel provides an excellent tutorial/reference - see the links at the end of the article for more information), you might update your MatrixMultiply function to look something like this:

// MatrixMultiply3 -- a C++/ASM version of MatrixMultiply2, which takes
// advantage of Intel's SSE instructions.  This version requires that
// M be in row-major order.
//
// Performance: 57 cycles/vector
void MatrixMultiply3(Matrix4f &m, Vector4f *vin, Vector4f *vout)
{
	// Get a pointer to the elements of m
	float *row0 = m.Ref();

	__asm {
		mov			esi, vin
		mov			edi, vout

		// load columns of matrix into xmm4-7
		mov			edx, row0
		movups	xmm4, [edx]
		movups	xmm5, [edx+0x10]
		movups	xmm6, [edx+0x20]
		movups	xmm7, [edx+0x30]

		// load v into xmm0.
		movups	xmm0, [esi]

		// we'll store the final result in xmm2; initialize it
		// to zero
		xorps		xmm2, xmm2

		// broadcast x into xmm1, multiply it by the first
		// column of the matrix (xmm4), and add it to the total
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0x00
		mulps		xmm1, xmm4
		addps		xmm2, xmm1

		// repeat the process for y, z and w
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0x55
		mulps		xmm1, xmm5
		addps		xmm2, xmm1
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0xAA
		mulps		xmm1, xmm6
		addps		xmm2, xmm1
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0xFF
		mulps		xmm1, xmm7
		addps		xmm2, xmm1

		// write the results to vout
		movups	[edi], xmm2
	}
}

Note: due to the way that SSE instructions work, it ends up being convenient to have the input matrix be in row-major order. If you'd rather keep your matrices in column-major order (the way OpenGL expects them), don't panic - it turns out you can do it either way without a significant performance penalty. Keep reading.

Here's an general overview of the SSE version of the multiplier algorithm:

- Load matrix columns into SSE registers xmm4-xmm7.

- Load input vector into xmm0.

- Initialize the output vector (xmm2) to zero.

- For each component of the input vector:

   
o broadcast that component into xmm1 (i.e. replicate its value into all four slots of xmm1).

   
o Multiply xmm1 by the appropriate column of the matrix (xmm4 for x, xmm5 for y, etc).

   
o Add the results to the output vector, xmm2.

- Write xmm2 to memory.

Not bad - about twice as fast as the original version! But there's still plenty of ways to make this function faster.

Batch Processing

Next, let's make a slight algorithmic change. We know matrix-vector multiplication is a common operation, but how often are we going to be transforming each vertex by a different matrix? Not very often - usually, we're doing things like transforming entire models into world-space, which involves multiplying lots of vertices by the exact same matrix (namely, the object's model matrix). This means we don't need to load the matrix before every vector - we can just load it once, and then feed the matrix a stream of vectors to be multiplied! With some slight modifications to our code, we end up with a BatchMultiply function:

// BatchMultiply1 -- A modification to MatrixMultiply2 in which we
// multiply several input vectors (vin) by the same matrix (m), storing the
// results in 'vout'.  A total of 'len' vectors are processed.  This
// prevents us from having to re-load 'm' every time through the loop.
// This also allows us to embed the tranpose operation into the function
// body, so we can continue to store our matrices in column-major order,
// if we wish.
//
// Performance: 32 cycles/vector
void BatchMultiply1(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
{
	// transpose the matrix into the xmm4-7
	m.TransposeIntoXMM();
	static const int vecSize = sizeof(Vector4f);

	__asm {
		mov			esi, vin
		mov			edi, vout
		mov			ecx, len

BM1_START:
		// load the next input vector into xmm0, and advance the input
		// pointer
		movups	xmm0, [esi]
		add			esi, vecSize

		// we'll store the final result in xmm2; initialize it
		// to zero
		xorps		xmm2, xmm2

		// broadcast x into xmm1, multiply it by the first
		// column of the matrix (xmm4), and add it to the total
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0x00
		mulps		xmm1, xmm4
		addps		xmm2, xmm1

		// repeat the process for y, z and w
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0x55
		mulps		xmm1, xmm5
		addps		xmm2, xmm1
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0xAA
		mulps		xmm1, xmm6
		addps		xmm2, xmm1
		movups	xmm1, xmm0
		shufps	xmm1, xmm1, 0xFF
		mulps		xmm1, xmm7
		addps		xmm2, xmm1

		// write the results to vout, and advance the output pointer
		movups	[edi], xmm2
		add			edi, vecSize
		dec			ecx
		jnz			BM1_START
	}
}

Better still! Note that since we're now processing many vectors at once, it's feasible to insert a transpose operation during the function initialization - the amortized cost across all the input vectors is virtually zero (however, there's a pretty fast SIMDified matrix transpose function included in the sample code). So it doesn't really matter whether you store your matrices in column-major or row-major order, so long as you remember to transpose them into row-major order before loading them into the SSE registers.

16-byte memory alignment

Okay, next lesson: when looking through the SSE instruction reference, did you notice that there are two different instructions for data movement: movups (unaligned MOV) and movaps (aligned MOV)? If you tried using movaps, you probably ended up with a "privileged instruction" runtime error, which is why you settled on movups. As it turns out, movaps is much faster than movups, but it requires that its arguments be aligned on 16-byte boundaries when moving to/from memory (otherwise, the instruction throws an exception). You have to do a little extra work in order to get your matrices and vectors to be 16-byte aligned, but it's worth it.

The first thing I tried in Visual C++ was to go into Project -> Settings -> C/C++ -> Code Generation, and change the "struct member alignment" field from "8 bytes" to "16 bytes". Problem solved? No, apparently not. Changing this field doesn't appear to do anything useful (if anybody knows why, please explain it to me!). Instead, I had to resort to making changes in the code itself, using the C/C++ __declspec(align(16)) specifier. You can put this specifier in front of the definition of any variable, data member, struct or class in order to force the object to be 16-byte aligned (in the case of structs or classes declared as __declspec(align(16)), all instances of that struct/class will be properly aligned).

It's important to note that apparently this method doesn't work for objects (or arrays of objects) which are dynamically allocated (with malloc or new). Fortunately, the MSVC Processor Pack Upgrade includes an aligned_malloc() function which should do the trick. Otherwise, it may be necessary to write your own memory management routines to make sure your objects are properly aligned. Email me if you'd like suggestions on how to set this up.

After modifying my Matrix4f and Vector4f classes to be 16-byte aligned, simply replacing all instances of movups with movaps in our BatchMultiply function results in another speedup! Here's the code:

// BatchMultiply2 -- A simple modification to BatchMultiply1: we now use
// aligned moves (movaps) instead of unaligned moves (movups).  This is much
// faster, but requires that Matrix4f and Vector4f objects are aligned
// on 16-byte boundaries.  We use the __declspec(align(16)) specifier in
// the Matrix4f and Vector4f class definitions to accomplish this.
//
// Performance: 28 cycles/vector
void BatchMultiply2(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
{
	// transpose the matrix into the xmm4-7
	m.TransposeIntoXMM();
	static const int vecSize = sizeof(Vector4f);

	__asm {
		mov			esi, vin
		mov			edi, vout
		mov			ecx, len

BM2_START:
		// load the next input vector into xmm0, and advance the input
		// pointer
		movaps	xmm0, [esi]
		add			esi, vecSize

		// we'll store the final result in xmm2; initialize it
		// to zero
		xorps		xmm2, xmm2

		// broadcast x into xmm1, multiply it by the first
		// column of the matrix (xmm4), and add it to the total
		movaps	xmm1, xmm0
		shufps	xmm1, xmm1, 0x00
		mulps		xmm1, xmm4
		addps		xmm2, xmm1

		// repeat the process for y, z and w
		movaps	xmm1, xmm0
		shufps	xmm1, xmm1, 0x55
		mulps		xmm1, xmm5
		addps		xmm2, xmm1
		movaps	xmm1, xmm0
		shufps	xmm1, xmm1, 0xAA
		mulps		xmm1, xmm6
		addps		xmm2, xmm1
		movaps	xmm1, xmm0
		shufps	xmm1, xmm1, 0xFF
		mulps		xmm1, xmm7
		addps		xmm2, xmm1

		// write the results to vout, advance the output pointer,
		// and loop
		movaps	[edi], xmm2
		add			edi, vecSize
		dec			ecx
		jnz			BM2_START
	}
}

Instruction Pairing

Chris's articles go into a great deal of detail about instruction pairing, so I'll only summarize here. Basically, if you order your code properly, the Pentium III will execute multiple instructions at the same time! There are two main strategies you use to ensure optimum instruction pairing.

First, try to eliminate data dependencies between adjacent instructions. If at all possible, avoid reading from and writing to the same register in adjacent instructions. In our case, that means instead of executing the "move / broadcast / multiply / add" loop once for each component of the input vector, we execute all the moves, then all the broadcasts, then all the multiplies, and then all the adds. This eliminates data dependencies between instructions.

Secondly, it's important to understand the instruction decoding mechanism of the processor. If your instructions are ordered properly (and free of data dependencies), the CPU can decode up to three instructions in the same clock cycle! Chris Dragan's article in Hugi #21 explains the Pentium III's instruction decoder quite well, so I won't go into too much detail. However, the most important practical lesson of the article is that only one packed SSE instruction (i.e. one whose opcode ends in 'ps', which operate on more than one value at once) can be decoded per cycle. Long strings of packed SSE instructions can be a major performance bottleneck - so much potential parallelism goes to waste! Wherever possible, try to alternate between packed and non-packed instructions.

Here's a version of BatchMultiply whose instructions have been re-ordered to improve performance:

// BatchMultiply3 -- A modification to BatchMultiply2 which makes better
// use of instruction pairing.
//
// Performance: 22 cycles/vector
void BatchMultiply3(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
{
	// transpose the matrix into the xmm4-7
	m.TransposeIntoXMM();
	static const int vecSize = sizeof(Vector4f);

	__asm {
		mov			esi, vin
		mov			edi, vout
		mov			ecx, len

BM3_START:
		// load the next input vector into xmm0, and advance the input
		// and output pointers
		movaps	xmm0, [esi]
		add			edi, vecSize

		// broadcast y into xmm1, z into xmm2, and w into xmm3 (leaving
		// x in xmm0).
		movaps	xmm1, xmm0
		add			esi, vecSize
		movaps	xmm2, xmm0
		movaps	xmm3, xmm0
		shufps	xmm0, xmm0, 0x00
		shufps	xmm1, xmm1, 0x55
		shufps	xmm2, xmm2, 0xAA
		shufps	xmm3, xmm3, 0xFF
		
		// multiply xmm0-3 by the appropriate columns of the matrix
		mulps		xmm0, xmm4
		mulps		xmm1, xmm5
		mulps		xmm2, xmm6
		mulps		xmm3, xmm7

		// sum the results into xmm1
		addps		xmm1, xmm0
		addps		xmm2, xmm3
		addps		xmm1, xmm2

		// write the results to vout, and loop
		movaps	[edi-0x10], xmm1
		dec			ecx
		jnz			BM3_START
	}
}

Prefetching

The next optimization is more subtle. In addition to the SIMD instructions we've been using so far, SSE also provides instructions that let you specify how you want your data to interact with the system cache. The Intel SSE tutorial provides full coverage of these instructions, but the one that interests us right now is the prefetchnta instruction. It allows us to pre-load a specific address into a special non-temporal area of the L1 cache, which is guaranteed not to displace any data that was loaded into the cache due to more legitimate reasons (such as temporal or special locality). Essentially, we use prefetchnta to indicate that we want to the data at the specified address to be in the cache, so we can read to (or write from) it exactly once, at which point it won't be used again.

This is exactly how we want to treat our streams of input and output vectors - we only want each vector to be in the cache long enough to be read from (or written to). Then we want it GONE, so that we have plenty of cache space for future vectors! So we add a pair of prefetchnta instructions to BatchMultiply - one for the input stream, and one for the output stream:

// BatchMultiply4 -- A modification to BatchMultiply3 which uses 
// SSE prefetching instructions to improve performance with large
// input sets.
//
// Performance: 21 cycles/vector
void BatchMultiply4(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
{
	// transpose the matrix into the xmm4-7
	m.TransposeIntoXMM();
	static const int vecSize = sizeof(Vector4f);

	__asm {
		mov			esi, vin
		mov			edi, vout
		mov			ecx, len

BM4_START:
		// load the next input vector into xmm0, and advance the input
		// pointer.  Prefetch upcoming vectors into the cache
		movaps	xmm0, [esi]
		prefetchnta	[esi+0x30]

		// broadcast y into xmm1, z into xmm2, and w into xmm3 (leaving
		// x in xmm0).
		movaps	xmm1, xmm0
		add			esi, vecSize
		movaps	xmm2, xmm0
		add			edi, vecSize
		movaps	xmm3, xmm0
		prefetchnta [edi+0x30]
		shufps	xmm0, xmm0, 0x00
		shufps	xmm1, xmm1, 0x55
		shufps	xmm2, xmm2, 0xAA
		shufps	xmm3, xmm3, 0xFF
		
		// multiply xmm0-3 by the appropriate columns of the matrix
		// (hiding a pointer increment between the multiplies)
		mulps		xmm0, xmm4
		mulps		xmm1, xmm5
		mulps		xmm2, xmm6
		mulps		xmm3, xmm7

		// sum the results into xmm1
		addps		xmm1, xmm0
		addps		xmm2, xmm3
		addps		xmm1, xmm2

		// write the results to vout, and loop
		movaps	[edi-0x10], xmm1
		dec			ecx
		jnz			BM4_START
	}
}

At first, the benefits of prefetching aren't apparent - we only saved a single cycle! However, once we start passing in larger streams of vectors (around 10000), the prefetching version of the code begins to perform significantly better than the non-prefetching version.

Technically, we should be prefetching data which is significantly ahead of the data we're currently operating on. You should experiment to find the specific prefetching offset that provides the best performance for your application (48 bytes seems to be the "sweet spot" for the Pentium 3). It might also be worth investigating the movntps (streaming store) instruction, which is used to write data from SSE registers directly into memory while bypassing the cache entirely (thus further avoiding cache pollution). See the Intel tutorial for more information.

To see the benefits of prefetching, run the sample program simdtest.exe with the -f flag. This will flush the entire cache between each test. You'll see that prefetching can double the speed of your code in the worst case. This is a useful technique to remember, since the prefetching routines can be used in non-SIMD code as well!

Increase Temporal Locality of Memory I/O

Our next optimization technique warrants another nod to Chris Dragan. He sped up his 3DNow multiplier from 19 cycles to 16 cycles by operating on two adjacent vectors concurrently! While this didn't change the amount of work he had to do on each vector, it reduced the overall amount of time spent waiting for memory I/O operations to complete. Reading two objects from memory one right after the other is faster than reading one object, then running a lot of code, then reading the other object. With a bit of re-organization, we can apply a similar method to our SSE multiplier.

First we need to reduce the number of registers we're using per vector. We're using four of the 8 SSE registers to store the matrix columns, one register for the input vector, one for the output vector, and either one or two for scratch values. This won't do at all if we want to work on two vectors at once; we need at least two registers to store the output vectors, leaving us only two registers for the input vectors and scratch values! It looks like we're stuck.

Well, not really - we don't really need to store the input vectors in registers. Instead, we can load individual components of the input vectors directly from memory to the scratch registers, and then broadcast, multiply and add them as usual! To load individual floats instead of whole vectors, we use the movss instruction. As an added bonus, movss is a non-packed instruction, which allows us to make better use of the parallel instruction decoders (see above)! Here's a version of BatchMultiply which operates on two vectors at once:

// BatchMultiply5 -- A modified version of BatchMultiply4 which loads
// vector components individually from memory, thereby allowing us
// to work on TWO VECTORS SIMULTANEOUSLY!
//
// Performance: 20 cycles/vector
void BatchMultiply5(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
{
	// initializations in C++ land
	Matrix4f mt(m, Matrix4f::TRANSPOSE); // work from a 
	float *row0 = mt.Ref();
	static const int vecSize = 2 * sizeof(Vector4f);
	
	// if there are an odd number of vectors, process the first one
	// separately and advance the pointers
	if (len & 0x1) {
		MatrixMultiply3(mt, vin, vout);
		++vin;
		++vout;
	}
	len >>= 1; // we process two vectors at a time

	__asm {
		mov			esi, vin
		mov			edi, vout
		mov			ecx, len

		// load columns of matrix into xmm4-7
		mov			edx, row0
		movaps	xmm4, [edx]
		movaps	xmm5, [edx+0x10]
		movaps	xmm6, [edx+0x20]
		movaps	xmm7, [edx+0x30]

BM5_START:
		
		// process x
		movss		xmm1, [esi+0x00]
		movss		xmm3, [esi+0x10]
		shufps	xmm1, xmm1, 0x00
		prefetchnta	[esi+0x30]
		shufps	xmm3, xmm3, 0x00
		mulps		xmm1, xmm4
		prefetchnta [edi+0x30]
		mulps		xmm3, xmm4

		// process y
		movss		xmm0, [esi+0x04]
		movss		xmm2, [esi+0x14]
		shufps	xmm0, xmm0, 0x00
		shufps	xmm2, xmm2, 0x00
		mulps		xmm0, xmm5
		mulps		xmm2, xmm5
		addps		xmm1, xmm0
		addps		xmm3, xmm2

		// process z
		movss		xmm0, [esi+0x08]
		movss		xmm2, [esi+0x18]
		shufps	xmm0, xmm0, 0x00
		shufps	xmm2, xmm2, 0x00
		mulps		xmm0, xmm6
		mulps		xmm2, xmm6
		addps		xmm1, xmm0
		addps		xmm3, xmm2

		// process w (hiding some pointer increments between the
		// multiplies)
		movss		xmm0, [esi+0x0C]
		movss		xmm2, [esi+0x1C]
		shufps	xmm0, xmm0, 0x00
		shufps	xmm2, xmm2, 0x00
		mulps		xmm0, xmm7
		add			esi, vecSize
		mulps		xmm2, xmm7
		add			edi, vecSize
		addps		xmm1, xmm0
		addps		xmm3, xmm2

		// write output vectors to memory, and loop
		movaps	[edi-0x20], xmm1
		movaps	[edi-0x10], xmm3
		dec			ecx
		jnz			BM5_START
	}
}

Of course, there's some additional overhead involved before we enter the main loop: we need to account for the case that we're passed an odd number of vectors. But as always, initialization costs drop to zero as the size of the data set increases, so for a reasonable number of input vectors, the impact of this additional setup is virtually zero.

Application-Specific Specialization

There's one more significant change we can make: by making some assumptions about the input vectors, we can shave off a few more cycles. Until now, we've been working with a general-purpose 4x4 matrix-vector multiplier. But remember that our original application involved transforming 3D vertices. 3D vertices are usually stored as four-element vectors, where the fourth element w is the "homogenous coordinate", used for perspective correction during the final transformation process. The value of w is initially 1.0, and it almost never changes (until the final stages of the rasterization pipeline, at which point the vertex is out of our control anyway). By assuming that w = 1.0 in the input vectors, we don't need to broadcast w and multiply it by xmm7; we can just add xmm7 directly to the output vector! Thus, we can eliminate three instructions (a move, a shuffle and a multiply) from our function (I'll call the new version BatchTransform, to differentiate it from the slower but more general BatchMultiply):

// BatchTransform1 -- A modified version of BatchMultiply4 which makes
// an additional assumption about the vectors in vin: if each vector's
// 4th element (the homogenous coordinate w) is assumed to be 1.0 (as is
// the case for 3D vertices), we can eliminate a move, a shuffle and a
// multiply instruction.
//
// Performance: 17 cycles/vector
void BatchTransform1(Matrix4f &m, Vector4f *vin, Vector4f *vout, int len)
{
	// initializations in C++ land
	Matrix4f mt(m, Matrix4f::TRANSPOSE); // work from a 
	float *row0 = mt.Ref();
	static const int vecSize = 2 * sizeof(Vector4f);
	
	// if there are an odd number of vectors, process the first one
	// separately and advance the pointers
	if (len & 0x1) {
		MatrixMultiply3(mt, vin, vout);
		++vin;
		++vout;
	}
	len >>= 1; // we process two vectors at a time

	__asm {
		mov		esi, vin
		mov		edi, vout
		mov		ecx, len

		// load columns of matrix into xmm4-7
		mov		edx, row0
		movaps	xmm4, [edx]
		movaps	xmm5, [edx+0x10]
		movaps	xmm6, [edx+0x20]
		movaps	xmm7, [edx+0x30]

BT2_START:
		// process x (hiding the prefetches in the delays)
		movss		xmm1, [esi+0x00]
		movss		xmm3, [esi+0x10]
		shufps	xmm1, xmm1, 0x00
		prefetchnta [edi+0x30]
		shufps	xmm3, xmm3, 0x00
		mulps		xmm1, xmm4
		prefetchnta	[esi+0x30]
		mulps		xmm3, xmm4

		// process y
		movss		xmm0, [esi+0x04]
		movss		xmm2, [esi+0x14]
		shufps	xmm0, xmm0, 0x00
		shufps	xmm2, xmm2, 0x00
		mulps		xmm0, xmm5
		mulps		xmm2, xmm5
		addps		xmm1, xmm0
		addps		xmm3, xmm2

		// process z (hiding some pointer arithmetic between
		// the multiplies)
		movss		xmm0, [esi+0x08]
		movss		xmm2, [esi+0x18]
		shufps	xmm0, xmm0, 0x00
		shufps	xmm2, xmm2, 0x00
		mulps		xmm0, xmm6
		add			esi, vecSize
		mulps		xmm2, xmm6
		add			edi, vecSize
		addps		xmm1, xmm0
		addps		xmm3, xmm2

		// process w
		addps		xmm1, xmm7
		addps		xmm3, xmm7

		// write output vectors to memory and loop
		movaps	[edi-0x20], xmm1
		movaps	[edi-0x10], xmm3
		dec			ecx
		jnz			BT2_START
	}
}

Conclusion and Future Possibilities

Ideally, you want both 3DNow and SSE optimized versions of certain algorithms in your program (in addition to a non-optimized version, for testing or portability). But you certainly don't want to litter your code with conditional statements to choose the right version to use at runtime. Compiling a separate version of your program for each processor variant is obviously the optimal solution performance-wise, but it's a tad inelegant (and it's easy to get the different versions confused). Using the Strategy design pattern, we can create a single monolithic application which runs the appropriate code on all supported platforms, without the additional costs in performance or code readability of abundant conditional statements.

It's also worth noting that BatchMultiply5(mat, vin[4], vout[4], 4) is a fast, general-purpose drop-in replacement for 4x4 matrix*matrix multiplication!

List of links and resources:
- Intel's Small Matrix Library
- Intel's SSE tutorial
- Intel's compiler
- Agner Fog's Pentium Optimization Document
- libSIMD
- Thanks to Mark Larson for his suggestions and feedback!

Cort Stratton